Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML dict: enable color rendering and per-dict user fix html func #3585

Merged
merged 1 commit into from Jan 9, 2018

Conversation

poire-z
Copy link
Contributor

@poire-z poire-z commented Jan 9, 2018

  • Enable Mupdf color rendering in HtmlBoxWidget.
  • Fix possibility of crash in HtmlBoxWidget:free() (I don't think a crash can happen with the current master code, but I have some work in progress that uses more free() to prevent memleaks and I had some crashes there, so let's fix it now).
  • Allow for user's hooks for HTML dict to tweak/fix the bad HTML of some dicts, by allowing users to put a LUA function in some file named as the .ifo with a .lua extension (HTML dictionary support #3573 (comment))

With a painfully crafted PetitRobert2007_1.1.lua (updated 20180115 ), I managed to turn this:
robert2007bof
(the HTML returned is really really bad, full of errors and inconsistencies)

into:
robert2007html

Too bad I won't be using it, as it turns out this PetitRobert2007 does not return all definitions (ie, you get 2. Critique, the 2nd definition for word critique, that suggest to go see 1. Critique, but you never get it in the results...) edit: just realized the PetitRobert2003 returns the 1. Critique but not any 2. Critique ... so it's either a bug when these 2 dicts were made, or something in sdcv...
edit: ok, it's a sdcv bug already reported in Dushistov/sdcv#30

(The collect/serialize to fix HTML tags balance I used in this file seems OK, but there's many chances that it won't work as is and need individual tweaks - so I don't think we can and should make use of it as an alternative to htmltidy by default for all HTML output that Mupdf woudln't like).

(@Frenzie : you can release this morning's builds without this with no problem.)

Also fix possibility of crash in HtmlBoxWidget:free()
@Frenzie
Copy link
Member

Frenzie commented Jan 9, 2018

I'm not entirely sure that I can just yet. :-P

On my H2O I found I wasn't able to scroll, even though it's working on the emulator.

@poire-z
Copy link
Contributor Author

poire-z commented Jan 9, 2018

Confirmed: scrolling with swipe does not work in scrollhtmlwidget (I scroll with tap mostly). It does not scroll for me either in the emulator.
Small typo (a natural/visual half-conscious corretion I guess). Easy to fix:

     if Device:isTouchDevice() then
         self.ges_events = {
-            SwipeScrollText = {
+            ScrollText = {
                 GestureRange:new{

Should I add it to this PR ? or you prefer an individual one?

@Frenzie
Copy link
Member

Frenzie commented Jan 9, 2018

I suppose a separate quickie is better for history.

@Frenzie Frenzie merged commit 8c897a0 into koreader:master Jan 9, 2018
@Frenzie
Copy link
Member

Frenzie commented Jan 9, 2018

Another crash that I figure should theoretically be avoided is the error throwing, i.e., sticking it behind an if dbg.is_on then and doing something a little more user-friendly otherwise.

if not ok then
error(self.document)
end

But that's more of a hypothetical thing. Since I'm not completely sure what the best friendly thing to do is, other than not closing the program[1] I won't throw in a quickie for that one and I won't let it hold up anything tomorrow. ;-)

[1] Retry yet again with known good HTML saying something about an error? Close the widget and show some kind of InfoMessage instead? Leave it white with an InfoMessage? Some other option?

@poire-z poire-z deleted the htmldict_color_fixfunc branch January 9, 2018 21:03
@poire-z
Copy link
Contributor Author

poire-z commented Jan 9, 2018

I don't really know.
But I assume that with the html to plain text conversion just done before, there's no real reason for it to fail (except for other things like not enough memory, etc...), so we should be fine.

@poire-z
Copy link
Contributor Author

poire-z commented Jan 29, 2018

Sharing the html fix funcs I made for the 3 HTML french dicts I have (to be put alongside the .ifo and renamed as it with a .lua extension instead of .ifo):
PetitRobert2007_1.1.lua.txt
PetitRobert2003_v4.lua.txt
XMLittre.lua.txt

@Frenzie
Copy link
Member

Frenzie commented Jan 29, 2018

We should look into a sensible way of including the XML Littré fix. (What's the problem with something that explicitly has XML in the name? Some custom XML DTD rather than HTML?)

@poire-z
Copy link
Contributor Author

poire-z commented Jan 29, 2018

Yes, some XML which is not fully HTML. For example, in my .lua:

    html = html:gsub('<big>', '<span style="font-weight: bold; font-size: 1.1em">')
    html = html:gsub(' foreground="', ' style="color: ')

(+ some other cosmetics changes for my taste, like replacing colors with smaller italic fonts, etc...)
(But I'm not not sure I didnt explicitely changed sametypesequence=x to sametypesequence=h - so it may by default displays as text.)
(Or may be it was bad HTML, and the conversion to simple HTML is OK enough to have it looks as text like it did before - and I must have by chance see it was indeed some HTML, and tried to make something better out of it.)

@Frenzie
Copy link
Member

Frenzie commented Jan 29, 2018

Hm, how old is that dictionary? It's being continuously corrected, although it looks like it remains an XML version of ye olde Littré, see here.

Is there a script somewhere that automates the conversion from the XML Littré source to stardict? There's https://bitbucket.org/Mytskine/xmlittre-web but I'm not seeing any XSLT in that repo.

https://www.littre.org/faq

Anyway, just something that one could consider as a lunch or breakfast project. :-P

@poire-z
Copy link
Contributor Author

poire-z commented Jan 29, 2018

Oh, didn't know that. Files in mine are dated from 2006 :)

@Frenzie
Copy link
Member

Frenzie commented Feb 11, 2018

Those curious could investigate whether https://github.com/leafo/web_sanitize could be of any use. Although its purpose is to sanitize (i.e., from scripts and styles), it seems to handle unclosed inner tags. It's a fair bit smaller than libtidy. :-P

It depends on LPeg, but we already include that for lua-Spore.

@mtlive
Copy link

mtlive commented Apr 22, 2019

Note that MuPDF support for xhtml is limited.
For example it doesn't display some items nested in <table>, even though validated by tidy.

@mtlive
Copy link

mtlive commented Apr 23, 2019

I write a lua script to [test if it is possible to] correctly render persian/arabic strings in definitions, but per my tests koreader even doesn't load my script. May you help me please?
persian.lua.txt

@Frenzie
Copy link
Member

Frenzie commented Apr 23, 2019

MuPDF should already do that?

If you're talking about the KOReader GUI elements they don't, but I'd probably look into leveraging HarfBuzz. Last year we didn't include it, but now we do.

https://github.com/luapower/harfbuzz
https://github.com/luapower/harfbuzz/blob/a4fa5884950166248b78d768eb806bb413ef4cdf/harfbuzz_demo.lua#L162-L163

(Also: https://github.com/ufyTeX/luaharfbuzz)

per my tests koreader even doesn't load my script

You didn't say what you're trying to load and use it.

@mtlive
Copy link

mtlive commented Apr 23, 2019

MuPDF should already do that?

No it doesn't. Letters are seperated and left to right.
Currently I'm working on fixing dictionary results. But regarding rendering books, harfbuzz and mupdf are painfully slow and generated results aren't useful. We may talk about it in a related issue topic.
I'll post more details tomorrow.

@Frenzie
Copy link
Member

Frenzie commented Apr 23, 2019

No it doesn't.

What exactly is it that it doesn't do in what scenario? See, e.g., #1426 (comment)

@mtlive
Copy link

mtlive commented Apr 24, 2019

I found the reason, the script is run only when the dictionary .ifo has sametypesequence=h . But unforunately MuPDF displays persian characters as ▯, while without MuPDF at least characters are displayed.

What exactly is it that it doesn't do in what scenario?

Here are the screenshots:
Plain text dictionary:
farsi-plain
HTML dictionary:
farsi-html

@Frenzie
Copy link
Member

Frenzie commented Apr 24, 2019

That could be an issue somewhere in how we build or use MuPDF. I'm reasonably sure it's supposed to fall back to Noto.

@mtlive
Copy link

mtlive commented Apr 24, 2019

I even tried to set the font explicitly through css but it seems that MuPDF doesn't respect the font-family attribute at all. (verified on PC)

@Frenzie
Copy link
Member

Frenzie commented Apr 24, 2019

Setting it where and how? If you're testing MuPDF I would stick to the basics (e.g., a simple XHTML file) without complicating matters through the dictionary. ;-)

If you want to test the dictionary, see, e.g.:

local css = [[
@page {
margin: 0;
font-family: 'Noto Sans';
}
body {
margin: 0;
line-height: 1.3;
]]..css_justify..[[
}
]]

Which makes it really weird that we're seeing undefined characters; maybe it's actually some issue in the sdcv output or how it's passed on?

It definitely listens to the font there, because otherwise you'd be seeing some serif typeface instead of Noto Sans. :-)

@NiLuJe
Copy link
Member

NiLuJe commented Apr 24, 2019

My guess is that output is plain garbage, and not MuPDF failing to display Farsi ;).

(i.e., I'd double-check the sanity of the dictionary, and try the sdcv CLI, as @Frenzie suggested).

@mtlive
Copy link

mtlive commented Apr 24, 2019

tried a css with
@page { margin: 0; font-family: freeserif; }

Now farsi characters are displayed as free space (or nothing!). I'm sure that characters are there, because when I touch and hold on them koreader will seek for their definition and displays them there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants